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The concept of feature selectivity in sensory signal processing can be formalized as dimensionality 
reduction: in a stimulus space of very high dimensions, neurons respond only to variations within 
some smaller, relevant subspace. But if neural responses exhibit invariances, then the relevant 
subspace typically cannot be reached by a Euclidean projection of the original stimulus. We argue 
that, in several cases, we can make progress by appealing to the simplest nonlinear construction, 
identifying the relevant variables as quadratic forms, or "stimulus energies." Natural examples 
include non-phase-locked cells in the auditory system, complex cells in visual cortex, and motion- 
sensitive neurons in the visual system. Generalizing the idea of maximally informative dimensions, 
we show that one can search for the kernels of the relevant quadratic forms by maximizing the 
mutual information between the stimulus energy and the arrival times of action potentials. Simple 
implementations of this idea successfully recover the underlying properties of model neurons even 
when the number of parameters in the kernel is comparable to the number of action potentials 
and stimuli are completely natural. We explore several generalizations that allow us to incorporate 
plausible structure into the kernel and thereby restrict the number of parameters. We hope that 
this approach will add significantly to the set of tools available for the analysis of neural responses 
to complex, naturalistic stimuli. 



I. INTRODUCTION 

A central concept in neuroscience is feature selectiv- 
ity: as our senses are bombarded by complex, dynamic 
inputs, individual neurons respond to specific, identifi- 
able components of these data [TJ [2] . Neurons early in a 
processing pathway are thought to be sensitive to simpler 
features [2 0] , and one can think of subsequent stages of 
processing as computing conjunctions of these features, 
so that neurons later in the pathway respond to more 
complex structures in the sensory world [5 j. A major 
challenge for theory is to make this intuition mathemat- 
ically precise, and to use such a precise formulation to 
build tools that allow us to analyze real neurons as they 
respond to naturalistic inputs. There is a long history of 
such work, but much of it rests on the identification of 
"features" with filters or templates. Filtering is a linear 
operation, and matching to a template can be thought 
of as a cascade of linear and nonlinear steps. As we will 
see, however, there are many examples of neural feature 
selectivity, well known from experiments on visual and 
auditory systems in many organisms, for which such a 
description in linear terms does not lead to much simpli- 
fication. 

In this paper we use examples to motivate the sim- 
plest nonlinear definition of a feature, in which the rele- 
vant variable is a quadratic form in stimulus space. Be- 
cause the resulting variable is connected to the "energy in 
frequency bands" for auditory signals, we refer to these 
quadratic forms as "stimulus energies." To be useful, we 
have to be able to identify these structures in experi- 
ments where neurons are driven by complex, naturalistic 
inputs. We show that, generalizing the idea of maximally 
informative dimensions [6], we can find the maximally 



informative stimulus energies using methods that don't 
require special assumptions about the structure of the 
input stimulus ensemble. We illustrate these ideas on 
model neurons, and explore the amount of data that will 
be needed to use these methods in the analysis of real 



II. MOTIVATION 

To motivate the problems that we address, let us start 
by thinking about an example from the auditory system. 
This starting point is faithful to the history of our sub- 
ject, since modern approaches for estimating receptive 
fields and filters have their origins in the classic work 
of de Boer and coworkers on the "reverse correlation" 
method [7], which was aimed at separating the filtering 
of acoustic signals by the inner ear from the nonlinearities 
of spike generation in primary auditory neurons. We will 
see that mathematically identical problems arise in think- 
ing about complex cells in visual cortex, motion sensitive 
neurons throughout the visual pathway, and presumably 
in other problems as well. 

We begin with the simplest model of an auditory neu- 
ron. If the sound pressure as a function of time is it 
is plausible that the activity of a neuron is controlled by 
some filtered version of this stimulus, so that the proba- 
bility per unit time of generating a spike is 



r(t) r g 
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dTf(r)s(t-r) 



(1) 



where f(r) is the relevant temporal filter and g[-] is a 
nonlinearity; the spikes occur at times t\. The statement 
that neurons are tuned is that if we look at the filter in 
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Fourier space, 



/»= / dtf(t)e 



iujt 



(2) 



then the magnitude of the filter response, |/(cj)|, has a 
relatively sharp peak near some characteristic frequency 
uj c . If we choose the stimulus waveforms from a Gaussian 
white noise ensemble, then the key result of reverse cor- 
relation is that if we compute the average stimulus in the 
neighborhood of a spike, we will recover the underlying 
filter, independent of the nonlinearity, 



(*(t-*i)>«/(-t). 



(3) 



We emphasize that this is a theorem, not a heuristic data 
analysis method. If the conditions of the theorem are 
met, then this analysis is guaranteed to give the right 
answer in the limit of large amounts of data. If the condi- 
tions of the theorem are not met, then the spike-triggered 
average stimulus need not correspond to any particular 
characteristic of the neuron. 

Eq. is an example of dimensionality reduction. In 
principle, the neuron's response at time t can be deter- 
mined by the entire history of the stimulus for times 
t' < t. Let us suppose that we sample (and generate) 
the stimulus in discrete time steps spaced by dt. Then 
the stimulus history is a list of numbers 

s t = {s(i), s(t - dt), s(t - 2dt), • • • , s(t - Ddt)}, (4) 

where D is the effective stimulus dimensionality, set by 
D = T/dt, and T the longest plausible estimate of the 
integration time for the neural response. We can think 
of s t as a D-dimensional vector. If we know that the 
neural response is controlled by a linearly filtered ver- 
sion of the sound pressure stimulus, even followed by an 
arbitrary nonlinearity, then only one direction in this D- 
dimensional space matters for the neuron. Further this 
really is a "direction," since we can write the response 
as the Euclidean projection of s onto one axis, or equiv- 
alently the dot product between s and a vector W, 



r(t)=r g(W-s t ) : 



(5) 



where 



W = dt x {/(0), f(dt), f(2dt), • • • , f(T)}. (6) 



This explicit formulation in terms of dimensionality re- 
duction suggests a natural generalization in which several 
dimensions, rather than just one, are relevant, 

r(t) = r g(W 1 ■ s t , W 2 • s t , ■ ■ ■ , W K • * t ). (7) 

As long as we have K <C D, it still holds true that the 
neuron responds only to some limited set of stimulus di- 
mensions, but this number is not as small as in the sim- 
plest model of a single filter. 

Notice that if an auditory neuron responds accord- 
ing to Eq ([!]), then it will exhibit "phase locking" to 



periodic stimuli. Specifically, if s(t) = Acos(uot) and 
f(uj) = |/»|e+^, then r(t) = r g[A\f(uj)\cos(ujt - </>)]. 
So long as there is a nonzero response to the stimulus, 
this response will be modulated at the stimulus frequency 
co>, and more generally if we plot the spiking probability 
versus time measured by the phase ip = cot of the stim- 
ulus oscillation, then the probability will vary with, or 
"lock" to this phase. 

While almost all auditory neurons are tuned, not all 
exhibit phase locking. We often summarize the behav- 
ior of tuned, non-phase-locked neurons by saying that 
they respond to the power in a given bandwidth or to 
the envelope of the signal at the output of a filter. The 
simplest model for such behavior, which has its roots 
in our understanding of hair cell responses [SHU], is to 
imagine that the output of a linear filter passes through a 
weak nonlinearity, then another filter. The second stage 
of filtering is low-pass, and will strongly attenuate any 
signals at or near the characteristic frequency uj c . Then, 
to lowest order, the neuron's response depends on 



Pit) 
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dt'h{t 



t'W) 



(8) 



where f\ is the bandpass filter that determines the tuning 
of the neuron and fi is a smoothing filter which ensures 
that the cell responds to the power in its preferred fre- 
quency band while rejecting any temporal variation at or 
above the "carrier" frequency. The probability of spik- 
ing depends on this power p(t) through a nonlinearity, as 
before, 



it)=r g\p(t)}. 



(9) 



Intuitively, this simple model for a non-phase-locked 
neuron also represents a substantial reduction in dimen- 
sionality - all that matters is the power passing through 
a given frequency band, defined by the filter f\. On the 
other hand, we cannot collapse this model into the one di- 
mensional form of Eq. (J5|. To be concrete, suppose that 
the filter f\ has a relatively narrow bandwidth around its 
characteristic frequency uj c . Then we can write 



/i(r) = A(t) s'm(uj c T + </>), 



(10) 



where the amplitude A(r) varies slowly compared to the 
period of the oscillation. Let us denote by t\ the tempo- 
ral width of A(r), since this corresponds to the time over 
which the filter f\ integrates the stimulus, and similarly 
T2 will denote the temporal width of fi{j). To make 
sure that the power p(t) does not oscillate at (twice) 
the frequency cj c , we must force T2 ^> 2tt/uj c . But we 
still have two possibilities, (a) t\ ^> T2 ^> uj c and (b) 
T2 ^> Ti ^> uj c . If (a) is true, we can show that p(t) is the 
Pythagorean sum of the outputs of two filters that form 
a quadrature pair, 



p(t) 



dt'f a (t 



t'W) 
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FIG. 1. A non— phase— locked auditory neuron, (a) In this implementation a model neuron responds to a white noise 
stimulus, (b) The stimulus s is filtered through a temporal filter /i, which has the form t sm(ojt) exp(— t/n) where uj = 
2tt x 10 3 Hz and r\ — 3 ms. (c) The output of fi is shown here. The filter fi is narrow band, therefore the output oscillates 
at the characteristic frequency even when the input is white, (d) The output of fi is first squared, and in (e), convolved with 
a second filter of the form £exp(— £/t2), with a smoothing time constant T2 = 1 ms. (f) The normalized signal is finally 
thresholded to generate spikes at a mean rate that allows the spiking probability to be 0.1 per time step. We assume that time 
runs discretely in steps of dt = 1/20000 s. 
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dt'f (t 



t'W) 



where 



f a (T) fa ,4(t) sin(w c T - 
fp(r) fa A(t) cos(w c t ■ 



(11) 



(12) 
(13) 



On the other hand, if (b) is true, there is no simple de- 
composition, and the minimum number of dimensions 
that we need to describe this model is K ~ Tijdt, which 
can be quite large. 

We can measure the number of relevant dimensions 
using the spike-triggered covariance matrix. Specifically, 
if the stimulus vector s t has components si(t), then we 
can form the matrix 

ACij = ( Si (*) Sj (*)) f=Wke - <Si(*)*j(*)>, (14) 

where in the first term we average over the arrival time 
of the spikes and in the second term we average over all 
time. If the spiking probability behaves as in Eq. ([7]), and 
we choose the stimulus from a Gaussian ensemble, then 
AC has exactly K nonzero eig envalues [121IT3] . In Fig.[l] 



we schematize the model auditory neuron we have been 
describing, and in Fig. [2] we show the spike-triggered co- 
variance analysis for models in the two limits, T2 <C t\ 
and T2 ^> r\. Indeed we find that in the first case there 
are just two relevant dimensions, a quadrature pair of 
filters, whereas in the second case there are many rele- 
vant dimensions; these dimensions appear as temporally 
shifted and orthogonalized copies of the filter fi (t) . 

We can think of a neuron that does not phase lock 
as having an invariance: it responds to acoustic wave- 
forms that have energy in a relevant bandwidth near 
cj c , but it doesn't discriminate among signals that are 
shifted by small times. This invariance means that the 
cell is not just sensitive to one dimension of the stimulus, 
but to many, although these different dimensions corre- 
spond, in effect, to the same stimulus feature occurring 
at different times relative to the spike. Thus, we have 
a conflict between the notion of a "single feature" and 
the mathematical description of a "single dimension" via 
linear projection. The challenge is to provide a math- 
ematical formulation that better captures our intuition. 
Before presenting a possible solution, let's see how the 
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Short smoothing time : Ti > T2 

(a) Spike-triggered covariance matrix (b) STC eigenvalue spectrum (c) Reconstructed filters 




Long smoothing time : T-| < %2 

(d) Spike-triggered covariance matrix (e) STC eigenvalue spectrum (f ) Reconstructed filters 




FIG. 2. Covariance analysis of the non— phase— locked auditory neuron, (a) If the time constant of the smoothing 
filter f 2 is much shorter than that of the filter /1, the spike-triggered covariance matrix has a relatively simple structure, (b) 
Diagonalizing this covariance matrix yields 2 leading eigenvalues (the rest remain close to 0). (c) The eigenvectors corresponding 
to the 2 non-zero eigenvalues are the reconstructed filters plotted here. These form a quadrature pair, as shown in the inset, 
(d) If the smoothing time of the filter f 2 is larger than that of the first, the covariance matrix has a much richer structure, (e) 
The spike-triggered covariance matrix decomposes into multiple non-unique eigenvalues, (f) The eigenvectors corresponding 
to the non-zero eigenvalues give multiple time-shifted copies of the same filter. 



same problem arises in other cases. 

Since the classical work of Hubel and Wiesel [TJJ [15] , 
we know that cells in the primary visual cortex can be 
classified as "simple" and "complex." Although Hubel 
and Wiesel did not give a mathematical description of 
their data, in subsequent work, simple cells often have 
been described in the same way that we described the 
simplest auditory neuron in Eq [16 . If the light inten- 
sity falling on the retina varies in space (x) and time (t) 
as /(x, t), we can define a spatiotemporal receptive field 
F(x, r) and approximate the probability that a simple 
cell generates a spike per unit time as 



r(t) =r g 



J d 2 x J dr F(x,r) I(x,t) 



(15) 



If, as before, we assume that the stimulus is generated in 
discrete time steps (movie frames) with spacing dt, and 
that the stimulus influences spikes only within some time 
window of duration T, then we can think of the stimulus 
at any moment in time as being the T/dt frames of the 



movie preceding that moment, 

s t = {/(x, t), I(x, t - dt), J(x, t - 2dt), • • • , I(x, t - T)}. 

(16) 

If the relevant region of space is within dx d pixels, then 
this stimulus vector lives in a space of dimension D = 
d 2 T/(dt), which can be enormous. As in the discussion 



above, Eq. (15) is a restatement of the hypothesis that 



only one direction in this space is relevant for determining 
the probability that the simple cell generates a spike, and 
"direction" is once again a Euclidean or linear projection. 

For a complex cell, on the other hand, this single pro- 
jection is inadequate. Complex cells respond primarily 
to oriented edges and gratings, as do simple cells, but 
they have a degree of spatial invariance which means that 
their receptive fields cannot be mapped onto fixed zones 
of excitation and inhibition. Instead, they respond to 
patterns of light in a certain orientation within a large 
receptive field, regardless of precise location, or to move- 
ment in a certain direction. Corresponding to this intu- 
ition, analysis of complex cells using the spike-triggered 
covariance method shows that there is more than one 
relevant dimension [17]. As with non-phase-locked au- 
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ditory neurons, what defeats the simplest version of di- 
mensionality reduction in complex cells is the invariance 
of the response, in this case, invariance to small spatial 
displacement of the relevant, oriented stimulus feature. 

The simplest model of a complex cell is precisely anal- 
ogous to the quadrature pair of filters that emerge in 
the analysis of non-phase-locked auditory neurons. To 
be concrete, let us imagine that receptive fields are de- 
scribed by Gabor patches. The position x includes two 
orthogonal coordinates in visual space, which we call x\ 
and #2- Gabor patches have an oscillatory dependence 
on one of these coordinates, but simply integrate along 
the other; both the integration and the envelope of the 
oscillations are described by Gaussians, so that 



Fi(x, r) = cos(/c£i) exp 



2af 



and the quadrature filter is then 



F 2 (x, r) = sin(fcxi) exp 



2(7? 



2a\ 



2*1 



f(T), (17) 



f(r). (18) 



Each of these filters is maximally sensitive to extended 
features oriented along the X2 direction, but the optimal 



patterns of light and dark are shifted for the two filters; 
for simplicity we have assumed that spatial and temporal 
filtering is separable. If we form the energy-like quantity 



p(t) 



J d 2 x J drFi(x,r) I(x,t) 

J d 2 x J dr F 2 (x,r) I(x,t) 



, (19) 



we have a measure of response to oriented features in- 
dependent of their precise position, and this provides a 
starting point for the analysis of a complex cell. 

One more example of the problem we face is provided 
by the computation of motion in the visual system. There 
is a classical model for this computation, the correlator 
model, that grew out of the experiments by Hassenstein 
and Reichardt on behavioral responses to visual motion 
in beetles and flies [19]. Briefly, the idea behind this 
model is that if something is moving at velocity v, then 
the image intensity I(x, t) must be correlated with the 
intensity at I(x + A,t + r), where r = A/v. Then we 
can detect motion by computing this correlation and av- 
eraging over some window in space and time, 



C A ,rW = J dxW(x) J dt'f(t - t')I(x + A, t')I(x + A, t' - 



r), 



(20) 



where for simplicity, we think just about a single spatial dimension. In principle, with just one value of the delay r 
and one value of the displacement A, this correlation "detects" only motion at one velocity, v = A/r, but we can 
easily generalize this computation to a weighted sum over different values of these spatiotemporal shifts, 



C(t) = J dxW(x) J dt'f(t-t') J dr^F(A,r)J(x + A,t')I(x + A,t' + r). 



(21) 



In the insect visual system it seems natural to think of 
the correlations in Eq. (21) as being computed from the 



Depending on the precise form of this weighting we can arrange for the correlation C to have a relatively smooth, 
graded dependence on the velocity. 

I 

phase. More directly, in this case the brain is trying to 
compute a velocity, more or less independently of the ab- 
solute position of the moving objects. Even in an insect 
visual system, where computing C corresponds to corre- 
lating the filtered outputs of neighboring photoreceptors 
in the compound eye, this computation is repeated across 
an area containing many photoreceptors, and hence there 
is no way to collapse C down to a function of just one or 
two Euclidean projections in the stimulus space. 

What do these examples - non-phase-locked auditory 
neurons, complex cells in the visual cortex, and visual 
motion detectors - have in common? In all three cases, 
the natural, simplest starting point is a model in which 
the brain computes not a linear projection of the stimulus 
onto a receptive field, but rather a quadratic form. More 
precisely, if stimuli are the vectors s = {si, S2? • " ' s d}? 
then Eq's. ([8]), (19), and (21) all correspond to comput- 



outputs of individual photoreceptors, which are typically 
spaced ~ 1° apart. In mammals, we can imagine com- 
puting a similar quantity, but we would use the outputs 
of larger retinal or cortical receptive fields [18 . We can 
also think about this computation in Fourier space. If we 
transform 



I(x,t) 



m 



dcu 
2tt 



I(k,uj) e 



ikx—iujt 



(22) 



then we can think of C as integrating power or energy 
P(k,uj) = \I(k,uj)\ 2 over some region in the k — uj plane; 
motion corresponds to having this power concentrated 
along the line uj = vk. Once again there is an invariance 
to this computation, since as with the non-phase-locked 
auditory neuron we are looking for power regardless of 
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ing a quantity 



D 



s T • Q • s = s\ Qjj s-y 

ij=l 



(23) 



Generalizing the use of terms such as "energy in a fre- 
quency band" and "motion energy," we refer to x as a 
"stimulus energy." 

If the matrix Q is of low rank, this means that we 
can build x out of projections of the stimulus onto a cor- 
respondingly low dimensional Euclidean subspace, and 
we can try to recover the relevant structure using meth- 
ods such as the spike-triggered covariance or maximally 
informative dimensions. Still, if Q is of rank 5 (for ex- 
ample), once we identify the five relevant dimensions we 
have to explore the nonlinear input /output relation of 
the neuron thoroughly enough to recognize whether these 
dimensions combine in a simple way, to recover our in- 
tuition that there is a single stimulus feature x to which 
the cell responds. This can be prohibitively difficult. 

In many cases, it is plausible that neurons could be 
responding just to one energy x, but the underlying ma- 
trix Q could be of full rank; a clear example is provided 
by correlator models for wide-field motion sensitive neu- 
rons, as in [I2j [19]. But in this case there is no real 
"dimensionality reduction" associated with the mapping 
from s — » x, if all we know how to do is to search for 
linear projections or Euclidean subspaces. On the other 
hand, mapping s — > x really is a tremendous reduction 
in complexity, because the full stimulus s is described by 
D parameters, while x is just one number. 

Suppose that the response of a neuron to complex stim- 
uli can be described by saying that the probability of 
spiking depends solely on a single stimulus energy x as 
in Eq. ([23]), so that 



r(t) = r g(sj -Q-s t ). 



(24) 



Our task becomes one of showing that we can recover 
the underlying matrix Q by analyzing the spike train in 
relation to the stimuli, without making any assumptions 
about the statistics of the stimuli. 

As we were finishing this work, we became aware of 
two other recent efforts that point to the importance of 
quadratic forms in stimulus space. It is shown in [27] 
that a logistic dependence of the spike probability on 
a quadratic form is the maximum entropy, and hence 
least structured, model consistent with a measurement 
of the spike-triggered covariance matrix, and that this 
identification is correct without any assumptions about 
the distributions of inputs. As a practical matter, Ref 
[27] emphasizes the case where the matrix kernel (cor- 
responding to Q above) is of low rank, and explores 
examples in which the quadratic form provides an in- 
cremental improvement on linear dimensionality reduc- 
tion. In a different direction, the work in [28 consider 
models in which the spike probability depends exponen- 
tially on a quadratic form, and spiking is explicitly a 



Poisson process. They show that this model, and some 
generalizations, lends itself to a Bayesian formulation, 
in which various simplifying structures (see Section IV) 
can be imposed through prior distributions on the possi- 
ble Q, although it is not clear whether computationally 
tractable priors correspond to realistic models. In some 
limits, these approaches are equivalent to one another, 
and to the search for maximally informative stimulus en- 
ergies that we propose here. As far as we can see, the 
Maximally Informative Stimulus Energy method always 
involves fewer assumptions, and can capture any of the 
structures postulated in the other methods. 



III. CORE OF THE METHOD 

If the probability of generating an action potential de- 
pends on the stimulus s, then observing the arrival of 
even a single spike provides information about the stimu- 
lus. Importantly, the data processing inequality [25] tells 
us that if we look not at the full stimulus but only at some 
limited or compressed description of the stimulus - a sin- 
gle feature, for example - we can only lose information. 
If the neuron really is sensitive to only one stimulus fea- 
ture, however, it is possible to lose none of the available 
mutual information between spikes and stimuli by focus- 
ing on this one feature. This suggests a natural strategy 
for identifying relevant stimulus features, by searching 
for those which preserve the maximum amount of infor- 
mation [6]. Further, we can put the success of such a 
search on an absolute scale, by estimating more directly 
the information that spikes provide about the stimulus 
[20| |2T] , and asking what fraction of this information is 
captured by the best feature we could find. 



A. Setting up the problem 

To make this precise, we recall that the information 
about the stimulus s that is conveyed by the observation 
of a single spike can be written, following [20], as 



■'spike 



d D s P(s| spike) log 2 



P(s| spike) 



bits, (25) 



where P(s) is the stimulus distribution and P(s| spike) 
is the distribution of stimuli given that a spike occurred 
at a particular time (the response conditional ensemble 
[22]). If we consider a mapping M : s — ^ x, then we can 
also compute 



^spike(A^) = / dx P(x | spike) log 2 



P(x| spike) 



P(x) 



, (26) 



knowing that for any mapping M., I S pike(M) < / S pike- If 
we restrict ourselves to some class of mappings, then we 
can search for an M* which maximizes / sp ike(A / l*), and 
see how close this is to the real value of i" S pike- If the 
inputs are Gaussian, then we can find M. by standard 
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spike-triggered or reverse correlation methods. Working 
with arbitrary (i.e., natural) stimulus ensembles compli- 
cates things. Further, if there is just one stimulus feature 
x, then computing I 8 pike(-M) involves only probability 
distributions over a single variable, so there is no curse 
of dimensionality. Finally, in order for this to work, we 
need to make no assumptions about the form of the dis- 
tribution of stimuli P(s), or the inherited distribution of 
stimulus features P{x). Thus, as first emphasized in [6], 
searching for maximally informative features provides a 
practical and principled way to analyze neural responses 
to high dimensional, naturalistic stimuli. 

The work in [6] considered the case where the stim- 
ulus features are one or more linear projections of the 
stimulus, x a = W a -s, so that the mapping M. is pa- 
rameterized by the vectors {W a }; in the simplest case 
there being just one vector W. Here we are interested in 
quadratic mappings, corresponding to the stimulus ener- 
gies in Eq. ( [23] ) . Now the mapping A4 is parameterized 
by a symmetric matrix Q. In principle, all the arguments 
of [6] for the linear case should generalize to the quadratic 
case, and we follow this path. Before starting, we note an 
obvious problem related to the number of parameters we 
are looking for. If we are searching for a vector W in a 
D-dimensional stimulus space, we are looking an object 
described by D numbers. In fact, the length of the vector 
is irrelevant, so that maximizing I S pike(>M) corresponds 
to optimization in a D — 1 dimensional space. But if we 
are searching for a symmetric matrix that acts on stim- 



ulus vectors, there are D(D + l)/2 — 1 free parameters. 
This is a problem both because we have to optimize in 
a space of much higher dimensionality, and because de- 
termining more parameters reliably must require larger 
data sets. We will address these problems shortly, but 
let's start by following the path laid out in [6], which 
involves searching for the maximum of I S pike(M) by gra- 
dient ascent. 



If our stimulus feature is the energy in Eq. (23), then 
the distribution of x is 



p q(%) = J S(x - s T • Q • s) P(s), 



(27) 



where the subscript explicitly denotes that P{x) depends 
on Q. We take the derivative of this with respect to an 
element Qy in the matrix Q, 



dP Q {x) 



d D s 



d 



-I 



dQ Ax-s T -Q-s)P(s) (28) 
d D s s iS j S'(x — s T • Q • s) P(s). (29) 



Similarly, we can differentiate the distribution of x con- 
ditional on a spike, 



dPQ(x\spike) 



d D s SjSj S'(x — s T -Q-s) P(s|spike). 

(30) 

Putting these terms together, we can differentiate the 
information: 



dlspike(M) 



/ 



dx 



WspikeCM) dP Q (x) SIspike(M) <9P Q (x|spike) 



5Pq(x) <9Qij ^P Q (x|spike 

— f ^ Pq(x|spik e) ^ D 
In 2 J P Q (x) 



1 

~Iri2 



dx ( 1 + In 



Pq (a; | spike) 



dP s SiSjS' (x — s T -Q • s)P(s|spike) 



[ dx f d D s s { s } S(x -s T -Q- s)P(s)-^- 
m £ j j tlx 



Pq (a; I spike) 



ln2 7 



dx 



Pq(x|spike^ 



J d D s SiS- } S(x ■ 



s T -Q • s)P(s|spike) 



dx 



P Q (x|spike) 



But we notice that 



d D s S[Sj S(x — s T -Q • s)P(s) = (s[Sj\x) Pq(x), 
where (s[Sj\x) is the expectation value of S[Sj conditional on the value of the stimulus energy x, and similarly 

J d D s SiSjS(x — s T -Q • s)P(s|spike) = (s[S- } \x, spike) Pq(x\ spike), 



(31) 



(32) 



(33) 



(34) 



(35) 



where (sjSj \x, spike) is the expectation value conditional on the energy x and the occurrence of a spike. We can 
combine these terms to give 



d^spikeCM) 



J dx Pq(x) [(siSj \x, spike) — (s[Sj\ 



d 

dx 



Pq(x|spike) 



(36) 
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or, more compactly, 

/d 
cIxPq(x) [(ss T |x, spike) — (ss T |x)] — 

To learn the maximally informative energy, or the best choice of the matrix Q, we can ascend the gradient in successive 
learning steps, 

Q — >• Q + 7 Vq/ where 7 is small. (38) 



Pq(x| spike) 



(37) 



r 



B. Multiple matrices 

In the same way that the idea of linear projection can 
be generalized to have the probability of spiking depend 
on multiple linear projections, we can generalize to the 
case where the are multiple relevant stimulus energies. 
Perhaps the simplest example Eq. ( [24] ) can be generalized 
to 2 matrices, is the computation of a (regularized) ratio 
between two stimulus energies, so that the probability of 
spiking varies as 



Some biological examples of this formulation include gain 
control or normalization in VI [17], optimal estimation 
theory of motion detection in visual neurons of insects 
[T2] and complex spectrotemporal receptive fields of neu- 
rons responsible for song processing in songbirds [23] [24] . 
The inference task becomes one of estimating both ma- 
trices Qi and Q2 by information maximization. 



rog 



Qi s 



Q2 s 



(39) 



As before, we can compute the gradient, noting that this time, there are two different gradients of I S pike(A^), 

~ P Ql (x| spike)" 



V Ql /= JdxP Ql (x) 
Vq 2 / = — dx Pq 2 (x 




ss 



Q2 s 

r Qi 



x, spike 



ss 



Q 2 s 



d_ 

dx 



and 



(l + s T -Q 2 .s) 2 



ss 



x, spike 



Qi 



(1 + s T • q 2 • s y 



ss 





d 


x^ 




dx 



PQ 2 (x|spike) 



(40) 



C. Technical aspects of optimization 



Analogous to Eq. (38), at every learning step, we update each matrix by the appropriate i th gradient, 

Qz^Qz + 7V Qz i. (41) 

where i = 1,2 for the x in Eq. (39). In principle the formalism in Eq. (39) could yield a more complete description 
of a neuron's nonlinear response properties compared to a single quadratic kernel convolving the stimulus, but there 
are some data-requirement challenges which we will address later. 

I 

An example of these ideas is shown in Fig. [3] We used 
very small patches from natural images as inputs, reshap- 
ing the intensities in nearby pixels into a D-component 
stimulus vector s where D = 10. To describe the neuron 
we chose a random symmetric matrix K to use as the 
kernel, and generated spikes when the stimulus energy, 
s T • K • s, crossed a threshold, as illustrated in Fig. [3p. 
We fixed the mean rate of the spiking response such that 
the probability of a spike occurring in one bin of duration 
dt is 0.1, and we generated ~ 1000 spikes. We then tried 
to extract the neuron's receptive field by starting with a 
random initial matrix Q, and following the gradient of 
mutual information, as in Eq. (38). 

We let the one parameter of the algorithm, 7, gradu- 
ally decrease from a starting value of 0.5 to 0.05, in order 
to minimize the fluctuations around the true maximum 



In order to implement Eq. (38) as an algorithm, we 
have to evaluate all the relevant probability distributions 
and integrals. In practice, this means computing x for 
all stimuli, choosing an appropriate binning along the 
x-axis, and sampling the binned versions of the spike- 
triggered and prior distributions. We compute the expec- 
tation values (ss T ) separately for each bin, approximate 
the integrals as sums over the bins, and derivatives as 
differences between neighboring bins. To deal with local 
extrema in the objective function, we use a large start- 
ing value of 7 and gradually decrease 7 during learning. 
This basic prescription can be made more sophisticated, 
but we do not report these technical improvements here. 
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(a) Example (b) Normalized mutual information (c) Reconstruction error 




Time,s Learning steps Learning steps 



FIG. 3. Core of the method, (a) A general implementation is shown here. The stimuli s are natural image clips which are 
D x D pixel patches resized from a natural image database, as described in [26]. 1000 spikes are generated with a probability 
per time bin of 0.1 from the model neuron by a thresholding the term, s T • K • s where the 10 x 10 matrix K is the receptive 
field of the neuron, (b) Mutual information between the spiking response of the model neuron and the quadratic stimulus 
projection x is plotted as a function of the number of learning steps. Information, normalized by its value when K = Q, peaks 
at the 40 th learning step and then plateaus. The 3 black dots on the trace denote the points at which we extract the initial, the 
intermediate and the optimal matrices. The maximally informative matrix Q reconstructed at the 40 th step, agrees well with 
K, indicating convergence. For this implementation the step size 7 = 0.5 at the start and 0.05 at the end of the algorithm, (c) 
Root-mean-square (RMS) reconstruction error calculated as ((K — Q) 2 ) 1//2 , is plotted as a function of the number of learning 
steps. This error decreases steadily until either the randomly initialized matrix (solid line) or the matrix initialized to the 
spike-triggered covariance matrix (dashed line) matches K. If Q is initialized to the covariance matrix, the initial RMS error 
is smaller and the convergence is faster (30 th learning step) compared to that for a randomly initialized Q. For this example, 
both K and Q are 10 x 10 matrices and the black dot on the solid trace is at the same learning step as in panel (b). 



of the information. Mutual information, the red trace 
in Fig. [3)3, peaks at the 40 th learning step and remains 
unchanged after that. The 3 black dots in Fig. [3}d cor- 
respond to the steps during the optimization when we 
extract and plot the initial guess, the intermediate and 
the optimal/maximally informative matrix Q. It is in- 
teresting to note that the intermediate matrix appears 
completely different from the optimal Q even though the 
corresponding mutual information is relatively close to its 
maximum (a similar observation was made in the context 
of maximally informative dimensions [6]). 

In Fig. [3^ the root-mean-square (RMS) reconstruction 
error ((K - Q) 2 ) 1 / 2 is plotted as a function of the num- 
ber of learning steps for a randomly initialized Q (solid 
line) and when Q is initialized to the spike-triggered co- 
variance (STC) matrix (dashed line). RMS error at the 
start of the algorithm w 1 when the "true" matrix K 
and the initial guess for Q are symmetric, random ma- 
trices, uncorrelated with each other, but is slightly lower 
when Q is initialized to the STC. This difference be- 
comes smaller as the stimulus dimensionality D increases 
or as the stimulus departs more strongly from Gaussian- 
ity. Both traces decrease, and stop changing once our 
estimate of the optimal Q matches K. This occurs at 
the 40 th step for the randomly initialized Q and slightly 
sooner (30 th step) when initialized to the STC. If we had 
fewer spikes, our estimate for the optimal Q could still 
match K adequately, but the actual RMS error in recon- 
struction might be higher. We explore such performance 



measures and data requirement issues next. 



D. Performance measures and data requirements 

We assume that spiking responses of the system are 
less frequent than periods of silence and therefore the ac- 
curacy of our reconstruction should depend of the num- 
ber of spikes iV S pik es produced by the model (or recorded 
data in the experimental context), independent of the ac- 
tual threshold. Further, we expect that when the stim- 
ulus dimensionality D increases, the fact that there are 
more parameters needed to describe the kernel K means 
that performance will deteriorate unless we have access to 
more data. To address these issues we explore model neu- 
rons as in Fig. [3j but systematically vary D and AT spikes . 
Again we consider the most difficult case, with natural- 
istic stimuli and kernels that are random symmetric ma- 
trices. The results are shown in Fig. [4] 

Our intuition is that the number of spikes should scale 
with the number of free parameters in the model, D(D + 
l)/2, so we always start with iV S pikes = D{D + \)/2. If we 
normalize the reconstruction errors by this initial error, 
and scale iV S pikes by the number of free parameters, the 
data collapse, as shown in Fig. ^p. Evidently accurate 
reconstructions of very large matrices requires very many 
spikes. This suggests that it will be important to have 
some more constrained models, which we now explore. 
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(a) Reconstruction error vs. N sp jk es 



(b) Data collapse 

1.01 




g spikes 



200 400 600 

2N spikes /D(D+1) 



1000 



FIG. 4. Data requirement and performance issues, (a) Reconstruction error ((K — Q) 2 ) 1//2 is plotted as a function of 
number of spikes (iV sp ikes) for matrices corresponding to stimuli of increasing dimensionality (D = 10,20,40 & 50). (b) The 
traces for different values of D collapse when the error is normalized by the (maximum) value at iV S pikes = D(D + l)/2 for each 
D, and plotted as a function of 2N sp i^ es /(D(D + 1)). 



IV. CONSTRAINED FRAMEWORKS 



orthogonal vectors {vi} such that 



Q = E 



T 

ViV,- . 



(42) 



The most general stimulus energy is described by 
D(D + l)/2 parameters, and this quickly becomes large 
for high dimensional stimuli. In many cases it is plau- 
sible that the matrix kernel of the stimulus energy has 
some simpler structure, which can be used to reduce the 
number of parameters. 



One way to simplify the description is to use a matrix 
that has low rank. If, for example, the rank of the ma- 
trix Q in Eq. (23) is p < D, then we can find a set of 



In terms of these vectors, the stimulus energy is just x = 

ELiK T -s) 2 - 

The low rank approximation reminds us of the sim- 
pler, Euclidean notion of dimensionality reduction dis- 
cussed above. Thus, we could introduce variables X{ = 
• s for i = 1,2, The response would then 

be approximated as depending on all of these variables, 
r(xi, #2, x p), as m Eq. (7f. In the stimulus energy 
approach, all of these multiple Euclidean projections 
are combined into x = J2i x h so ^ na ^ nave a more 
constrained but potentially more tractable description. 
When Q is written as Q = Yli=i v * v ?\ the relevant gra- 



dient of information, analogous to Eq. (37) is 



J 



V Vi / = 2 J dxP Q (x) [(s(v, T s)|x,spike) - (s(v, T s)|x)] 



d_ 

dx 



P Q (x|spike) 



(43) 



and we can turn this into an algorithm for updating our 
estimates of the v^, 



Vi ->> Vi 



•7V Vi I- 



(44) 



There is a free direction for the overall normalization of 
the matrix Q [6j [17] which makes the mutual information 
invariant to reparameterization of the quantities. We or- 
thogonalize the vectors during the learning procedure 
to make sure that they don't grow out of bound. 



r 



Another way of constraining the kernel of the stimulus 
energy is to assume that it is smooth as we move from 
one stimulus dimension to the next. Smooth matrices 
can be expanded into weighted sums of basis functions, 



(a0 



(45) 



and finding the optimal matrix then is equivalent to cal- 
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culating the most informative M-dimensional vector of 
weights. 

The basis can be chosen so that systematically increas- 
ing the number of basis components M allows the recon- 
struction of progressively finer features in Q. For exam- 
ple, we can consider {B^} to be a set of Gaussian bumps 
tiling the DxD matrix Q, and whose scale (standard de- 
viation) is inversely proportional to \[M. For M D 2 /2 
the basis matrix set becomes a complete basis, allowing 
every Q to be exactly represented by the vector of coeffi- 
cients a. In any matrix basis representation, the learning 
rule becomes, 



M 



dl 



B 



(m) 



i,j=i J 



(46) 



This is equivalent to taking projections of our general 
learning rule, Eq. (38), onto the basis elements. 



V. RESULTS FOR MODEL NEURONS 

A. An auditory neuron 

As a first example, we return to the model auditory 
neuron whose response properties from Eq's (|8| and ([9| 
were schematized in Fig. [T] Rather than studying its 
responses to white noise stimuli, however, we consider the 
responses to bird song, as shown in Fig. [5|a). We start 
with Eq. ([8| and see that it is equivalent to a stimulus 
energy with kernel K defined through 

p(t) = J dh J dt 2 s(t 1 )K(t-t 1 ,t-t2)s(t2), (47) 

K = J drf 2 (r)h(t - r - t x )h(t - r - t 2 ). (48) 

We used the same filters as we showed in Fig. [IJb) and 
(e) to construct K, which is plotted in Fig.|5|b). We can 
also look in a mixed time-frequency representation to 
generate a spectrotemporal "sensitivity," K, as follows: 



K(w,t) = J drK(t+ T -,t- T -)e +i 



(49) 



Eq. suggests that K is a function of uj and t, as seen 
in Fig. |5[c). K describes the selectivity of the model 
neuron to the stimulus, in this case the bird song. We 
see that this neuron responds robustly to a sound with 
a frequency around lKHz with a temporal dependence 
dictated by the time constants of the 2 filters that make 
up the neuron's receptive field matrix K. This descrip- 
tion has the flavor of a spectrotemporal receptive field 
(STRF), but in the usual implementations of the STRF 
idea a spectrogram representation is imposed onto the 
stimulus, fixing the shapes of the elementary bins in the 
time-frequency plane. Here, in contrast, Fourier trans- 
forms are in principle continuous, and we don't need to 



assume that the only relevant variable is stimulus power 
in each frequency band. 

The natural stimuli we used to probe this model audi- 
tory neuron's receptive field came from recordings of ze- 
bra finch songs, modified into stimulus clips s. The songs 
were interpolated down from their original sampling rate 
to retain the same discrete time steps (dt = 1 /20000s) 
that we use in Fig.[l] An example sound pressure wave of 
a song stimulus is plotted in Fig.[5|a). In the same panel, 
we emphasize the structural complexity of the stimulus in 
the spectrogram of the same song. The song spectrogram 
shows multiple motifs, significant temporal structure and 
several harmonic stacks, all of which point to the strong 
departure from Gaussianity of s. 

This model neuron, as illustrated in Fig. [I] emitted 
spikes when the power in Eq. (48) exceeded a threshold. 



We set the threshold of firing so that the mean spike 
rate was 20s _1 . We presented ~ 42 minutes of bird song 
stimuli to the model neuron, collecting roughly 50, 000 
spikes. 

We follow the gradient ascent procedure exactly as 
described in Eq. (38). Note that K in this model is a 



300 x 300 matrix, and we make no further assumptions 
about its structure; we start with a random initial con- 
dition (Fig. [5]i). The maximally informative matrix that 
we find is shown in Fig. [5^, and is in excellent agreement 
with the matrix K; we can see this in the frequency do- 
main as well, shown in Fig. [5]f. Quantitatively, the RMS 
reconstruction error for this inference is less than 5% of 
the maximum for any two uncorrelated random, symmet- 
ric matrices of the same size. 



B. Complex cell in the visual cortex 

We consider the model complex cell described earlier 
in Eq. (19), but allow integration over just one frame 
of a movie, so we don't need to describe the temporal 
filter. We chose parameters k = 27r/3, g\ = 1.6 and a 2 = 
5, with positions measured in pixels. The stimuli were 
20,000 grayscale 30 x 30 pixel image patches extracted 
from a calibrated natural image database [26]. Spikes 
were generated with a probability of a spike/bin of 0.1 
whenever the stimulus power p exceeded a threshold. We 
note that, in this model, the kernel is explicitly of rank 
2, and so we followed the algorithm in Eq. (43). The 
results are shown in Fig.[6j As expected, the best possible 
reconstruction is a vector pair vi , v 2 that is equal to the 
pair Fi, F 2 up to a rotation. Fig. [6] shows that this is 
indeed the case for the reconstructed filters vi, V2, which 
otherwise match the true filters well. 

Suppose that the real neuron, as in our model, is de- 
scribed by a kernel of rank 2, but we don't know this and 
hence search for a kernel of higher rank. As shown in 
Fig. [7]3, higher rank fits do not increase the information 
that we capture, either for random stimuli or for natural 
stimuli. Interestingly, however, the "extra" components 
of our model are not driven to zero, but appear as (re- 
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(a) Birdsong and spectrogram 
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FIG. 5. Analyzing the responses of the model auditory neuron to a bird song, (a) The sound pressure wave of 
a zebra finch song used as stimulus to the model neuron is shown along with its spectrogram. The spectrogram of the song 
illustrates that s is highly structured, full of harmonic stacks and complex spectrotemporal motifs, (b) The equivalent matrix 
K, constructed from the two filters as described in Eq. ^ is 300 x 300 in size but has a relatively simple structure, (c) Taking 
a Fourier transform over ti of K yields a spectrotemporal sensitivity matrix, K with a peak at approximately lKHz. (d) 
The initial guess for Q is the random symmetric matrix plotted here, (e) The optimal matrix Q that maximizes the mutual 
information between the spiking response of the model neuron and the ID projection x — s T • Q • s matches K well at the end 
of 100 learning steps, (f) The spectrotemporal sensitivity Q, corresponding to the maximally informative stimulus energy has 
the same response preferences as K. 



dundant) linear combinations of the two true underlying 
vectors, so that the algorithm still finds a genuinely two 
dimensional, albeit over-complete, solution. 



The convergence of a full rank matrix Q = 
Y^i=i v i v ? + where 77 is Gaussian random noise of 
O(D), to the model matrix K can be determined by 
looking at the projections of the leading eigenvectors of 
the matrix Q at the end of the maximization algorithm. 
In this complex cell example where spikes are generated 
from a rank 2 matrix, the two eigenvectors correspond- 
ing to the two leading eigenvalues for the fit, Q should 
be identical to what used to be vi and V2 before. The 
remaining eigenvalues should be driven to be 0, and this 
is indeed what we see in Fig. [7]i. 



C. Matrix basis formalism 

We illustrate this simplification by making use of the 
matrix basis expansion from Eq. ( [45] ) to infer a matrix 
K that is of arbitrarily high rank. For K we used a sym- 
metrized 250 x 250 pixel image of a fluid jet as shown 
in Fig. [8^1. While this is not an example of a recep- 
tive field from biology, it illustrates the validity of our 
approach even when the response has an atypical and 
complex dependence on the stimulus. Spikes were gener- 
ated by thresholding the energy s T • K • s, and the same 
naturalistic visual stimulus ensemble was used as before. 
Gaussian basis matrices shown in the inset of Fig. [HJd 
were used to represent the quadratic kernel, reducing the 
number of free parameters from ~6 x 10 4 to M = 225. 
We start the gradient ascent with a large 7 value of 1 and 
progressively scale it down to 0.1 near the end of the algo- 
rithm; Fig. [8b shows the information plateauing in about 
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(a) Mutual information 



(b) Projection of RFs 





Learning steps 

(c) Component receptive field s of model complex cell 
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(d) Maximally informative receptive fields 
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FIG. 6. Receptive fields of a model complex cell and 
the reconstructed maximally informative pair, (a) In- 
formation as a function of the number of learning steps peaks 
and then plateaus. The black dot is the point where the re- 
constructed receptive fields are shown in panel (d) below, (b) 
Reconstructed vectors vi, V2 are rotated versions of the re- 
ceptive fields Fi, F2, but span the same linear subspace (all 
vectors are normalized to unit length), (c) The receptive field 
of t he m odel complex cell is given by the two linear filters in 
Eq. (19]): Fi (left) and F 2 (right), (d) The reconstructed re- 
ceptive fields at the 100 th learning step (black dot in panel 
(a) above) with filters vi (left) and v 2 (right) rotated to best 
align with the F\ - F2 pair. 



40 learning steps. The maximally informative quadratic 
kernel Q reconstructed from these 225 basis coefficients 
is shown in Fig. [SJ^. Optimizing the 225 basis functions 
captures the overall structure of the kernel but this can 
be improved to an almost perfect reconstruction (at a 
pixel-by-pixel resolution) by increasing M, as shown in 
Fig. §1. 



K 2 , are plotted in Fig. |9^l, along with a schematic of the 
spikes produced when the nonlinearity g(-) is a thresh- 
old. Again we constructed stimuli from nearby pixels of 
natural images, so that the distribution is strongly cor- 
related and non-Gaussian; the threshold was set so that 
the probability of a spike/bin to 0.3, and we generated 
~ 10,000 spikes. We followed the algorithm in Eq's (40) 



and (41) to find the maximally informative matrices Qi 



and Q2. The original and the optimal matrices are plot- 
ted in Fig. [9)3, where see that for a model neuron with 
a randomly initialized duet of matrices, the optimal ma- 
trices agree with those of the model neuron. Mutual 
information in red, normalized by the maximum when 
Ki = Qi and K2 = Q2, and RMS reconstruction error 
in green (calculated as ([(K x + K 2 ) - (Qi + Q 2 )] 2 ) 1/2 ) 
are plotted as a function of the learning steps in Fig. |9j3. 
While convergence is definitely possible, our estimates of 
the maximally informative matrices are noisier than in 
the single matrix instances, even with a relatively large 
amount of data, as indicated in Fig. [9]i. We expect that 
realistic searches for multiple stimulus energies will re- 
quire us to impose some simplifying structure on the un- 
derlying matrices. 



VI. DISCUSSION 

While the notion that neurons respond to multiple pro- 
jections of the stimulus onto orthogonal filters is power- 
ful, it has been difficult to develop a systematic frame- 
work to infer a neuron's response properties when there 
are more than two filters. To get around this limita- 
tion, we propose an alternative model in which the neural 
response is characterized by features that are quadratic 
functions of the stimulus. In other words, instead of be- 
ing described by multiple linear filters, the selectivity of 
the neuron is described by a single quadratic kernel. The 
choice of a quadratic form is motived by the fact that 
many neural phenomena previously studied in isolation 
can be viewed as instances of quadratic dependences on 
the stimulus. We presented a method for inferring maxi- 
mally informative stimulus energies based on information 
maximization. We make no assumptions about how the 
quadratic projections onto the resulting matrices map 
onto patterns of spiking and silence in the neuron. This 
approach yields unbiased estimates for receptive fields for 
arbitrary ensembles of stimuli, but requires optimization 
in a possibly rugged information landscape. The methods 
we have presented should help elucidate how sensitivity 
to high-order statistical features of natural inputs arises 
in sensory cortical areas. 



D. Multiple matrices 
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(a) Over-fitting the model complex cell (c) Information saturates at rank of complex cell 




Rank 



FIG. 7. Over-fitting the model complex cell with matrices of successively increasing rank, (a) Receptive fields 
reconstructed after mutual information is maximized with matrices of rank p — 2,3,4 and 5 (from left to right), (b) The 
resulting vectors, vi through V5, at the end of the information maximization are no longer orthogonal but project fully into a 
unit circle in the F1-F2 plane, (c) Maximum mutual information as a function of the rank of fit, p, for random stimuli (open 
circles) or for the stimulus matrix generated using natural scenes (filled circles), peaks at the rank equal to that of the "data" 
(rank p — 2 for the model complex cell), and remains unchanged as the rank of Q increases, (d) Over-fitting the model matrix 
K with Gaussian noise does not add to the mutual information Iq and the algorithm successfully finds a two dimensional 
solution. Eigenvalue profile of matrix Q = J^SLi v * v ? + ? 7 where 77 is a 30 x 30 sized- A/"(0, 1) after maximizing information with 
respect to the complex cell. Aside from two leading eigenvalues with magnitude 1, the rest — > 0. 
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FIG. 9. Inferring 2 MISE's simultaneously, (a) Schematic of a model neuron with divisive gain control as described 
in the text. Natural images were reshaped and used as stimulus clips, s. The "true" matrices Ki and K2 were generated 
as random, symmetric 5x5 matrices, (b) The true (Ki and K2, top panel) and maximally informative matrices (Qi 
and Q2, bottom panel) are plotted here, (c) Mutual information (red) normalized by the maximum when Ki = Qi and 
K2 = Q2, peaks at the 80 th learning step and remains unchanged after. RMS reconstruction error (green), computed as 
([(Ki +K2) — (Qi + Q2)] 2 ) 1 ^ 2 decays to a steady low at the same step. Using more spikes will decrease this further, (d) 
The data requirement for simultaneously extracting 2 matrices Qi and Q2 to a precision of 30% is shown in the shaded 
region. iV S pikes here is proportional to the number of free parameters ~ D\ + D\, where D\ and D2 correspond to the stimulus 
dimensionality in the numerator and the denominator of x, respectively. 



